Model Selection

Cross-modal Alignment

# Cross-modal Alignment

Vit So400m Patch16 Siglip 256.webli I18n

A vision Transformer model based on SigLIP, focusing on image feature extraction with original attention pooling mechanism.

Image Classification

Vit Large Patch14 Clip 224.datacompxl

A vision Transformer model based on the CLIP architecture, specifically designed for image feature extraction, released by the LAION organization.

Image Classification

Mblip Bloomz 7b

mBLIP is a multilingual vision-language model based on the BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.

Transformers Supports Multiple Languages

mBLIP is a multilingual vision-language model based on BLIP-2 architecture, supporting image caption generation and visual question answering tasks in 96 languages.

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase